Predictive performance of different NTCP techniques for radiation-induced esophagitis in NSCLC patients receiving proton radiotherapy

This study aimed to compare the predictive performance of different modeling methods in developing normal tissue complication probability (NTCP) models for predicting radiation-induced esophagitis (RE) in non–small cell lung cancer (NSCLC) patients receiving proton radiotherapy. The dataset was composed of 328 NSCLC patients receiving passive-scattering proton therapy and 41.6% of the patients experienced ≥ grade 2 RE. Five modeling methods were used to build NTCP models: standard Lyman–Kutcher–Burman (sLKB), generalized LKB (gLKB), multivariable logistic regression using two variable selection procedures-stepwise forward selection (Stepwise-MLR), and least absolute shrinkage and selection operator (LASSO-MLR), and support vector machines (SVM). Predictive performance was internally validated by a bootstrap approach for each modeling method. The overall performance, discriminative ability, and calibration were assessed using the Negelkerke R2, area under the receiver operator curve (AUC), and Hosmer–Lemeshow test, respectively. The LASSO-MLR model showed the best discriminative ability with an AUC value of 0.799 (95% confidence interval (CI): 0.763–0.854), and the best overall performance with a Negelkerke R2 value of 0.332 (95% CI: 0.266–0.486). Both of the optimism-corrected Negelkerke R2 values of the SVM and sLKB models were 0.301. The optimism-corrected AUC of the gLKB model (0.796) was higher than that of the SVM model (0.784). The sLKB model had the smallest optimism in the model variation and discriminative ability. In the context of classification and probability estimation for predicting the NTCP for radiation-induced esophagitis, the MLR model developed with LASSO provided the best predictive results. The simplest LKB modeling had similar or even better predictive performance than the most complex SVM modeling, and it was least likely to overfit the training data. The advanced machine learning approach might have limited applicability in clinical settings with a relatively small amount of data.

www.nature.com/scientificreports/ different approaches, ranging from the conventional analytic functions to the latest machine learning techniques 7 . And yet, there is no consensus on what is the best approach for NTCP modeling in the context of radiotherapy. Traditional NTCP models were developed using analytic functions based on different assumptions of the dose-response relationship [8][9][10][11][12][13] . Among the analytic NTCP models, the Lyman-Kutcher-Burman (LKB) is the most well-known and widely used method [11][12][13] . In addition, it is the only NTCP model available in clinical treatment planning systems for biological evaluation 14 . In the standard LKB (sLKB) model, only the dosimetric factor was considered and the three-dimensional dose distribution was simplified into a single measure. It has been found the performance of the sLKB model was improved by adding one or more dose-modifying variables that account for the effect of non-dosimetric factors [15][16][17] . The LKB model with dose-modifying variables was denoted as the generalized LKB (gLKB) model hereafter. Another classic modeling technique to deal with various types of predictors is multivariable logistic regression (MLR), a statistical method to predict a binary outcome based on a set of independent variables 18 . Compared with the sLKB model which used only one dosimetric measure, the MLR model is capable of handling multiple dose-volume metrics, such as maximum dose, mean dose, Vx values (the percentage volume receiving a dose greater than x Gy), and Dx values (the minimum dose given to x% of the volume) 19 . The sLKB model showed a slightly lower discriminative ability than the MLR model 20 , but the difference between the performance of the generalized gLKB model and that of the MLR model has not been reported.
In the era of big data, the application of supervised machine learning in the treatment response modeling for radiotherapy has been rapidly growing. Support vector machine (SVM) 21 , a commonly used supervised machine learning algorithm, has shown promising accuracy in predicting lung radiation-induced pneumonitis 22 , local tumor control after stereotactic body radiation therapy for early-stage NSCLC patients 23 , and other clinical outcomes 24,25 . With the preset feature selection, the results of the comparison in performance between SVM and MLR were inconsistent 26,27 . It is of interest to compare their predictive power using a technique-specific feature selection scheme. Moreover, the advantage or the disadvantage of the complex SVM model over the simple LKB model remains unknown.
In this study, we aimed to identify the value of increasing the complexity of the NTCP model and using the advanced modeling technique in clinical practice by comparing the performance of LKB, MLR, and SVM methods for a data set concerning esophagitis.

Materials and methods
Data set. The data set consists of 328 NSCLC patients treated with passive-scattering proton therapy at The University of Texas MD Anderson Cancer Center during April 2006 to February 2012, and 136 (41.5%) patients experienced grade 2 or higher radiation-induced esophagitis 20 . This retrospective data collection was approved by our institutional review board (The University of Texas MD Anderson Cancer Center) with waivers for the patient's informed consent. The prescription was 50-82.5 Gy [relative biological effectiveness (RBE)] in 25-37 fractions, with or without concurrent chemotherapy (CCRT). The clinical data were collected from the Epic system (Epic Systems Corporation), and the dosimetric data were extracted from the Eclipse treatment planning system (version 8.9, Varian Medical Systems) and has been converted to equivalent dose in 2-Gy fractions. Details of the patient characteristics, treatment, follow-up schedule, and esophagitis scoring were presented elsewhere 20 . The clinical characteristics included in the study were age (odds ratio (OR) = 0.99), sex (OR = 0.99), stage (OR = 2.23), and CCRT (OR = 4.76). There was no missing data on the outcome and predictors. All methods were performed in accordance with the relevant guidelines and regulations.

Modeling techniques. Standard Lyman-Kutcher-Burman modeling.
In the LKB model, NTCP is calculated using the probit formula coupled with a generalized equivalent uniform dose (EUD) dose-volume histogram-reduction scheme 28 . The sLKB model described the dose-response relationship characterized by three parameters, in which n represents the volume effect, m denotes the slope of the NTCP curve at TD 50 , and TD 50 is the dose tolerance corresponding to 50% complication risk (Supplementary material 1).
Generalized Lyman-Kutcher-Burman modeling. In the gLKB model, a dose-modifying factor was introduced to account for the effect of clinical features. The use of CCRT has been shown to be associated with a higher probability of developing grade ≥ 2 esophagitis in this cohort of patients 20 . Therefore, we substituted the parameter TD 50 with a group-specific TD 50s for the subgroup treated with (TD 50y ) or without (TD 50n ) CCRT, while keeping the same n, m for the entire cohort (Supplementary material 1). The dose-modifying factor is the ratio of TD 50y and TD 50n .
In the sLKB and gLKB modeling, the same methods were used to determine the best fits and the 95% confidence intervals (CI) for the parameters. The maximum likelihood estimation was used to determine the optimal values for the parameters n, m, TD 50, TD 50y, and TD 50n . The 95% CIs for the fitted parameters were obtained using the profile likelihood method 29,30 .
Multivariable logistic regression. In the multivariable logistic regression model, the NTCP is modeled as a logit transformation of a linear function of several prognostic variables (Supplementary material 1). The candidate prognostic variables for each patient include four clinical and 17 dosimetric parameters. The clinical variables were sex, race, age, stage, and the use of CCRT. The dosimetric variables were EUD, the maximum dose (D max ), the mean dose (D mean ), and the percentage volume receiving a dose higher than x Gy(RBE) (x ranges from 10 to 75 in increments of 5). The results of the correlation between variables are available elsewhere 20 . Table 1  www.nature.com/scientificreports/ For the stepwise feature selection, a forward stepwise algorithm was adopted with the likelihood ratio test. Starting with a null model, the stepwise regression added or removed one variable at each step based on the p value for the likelihood ratio test. The variable selection process stopped if no variable can be added or removed.
Stepwise-MLR is referred to as the multivariable logistic regression model using the stepwise feature selection in the following sections.
The LASSO method profiles out the insignificant variables by penalizing the regression coefficients of the variables and shrinks some of them to zero. As such, variables with non-zero coefficients were selected. The degree of penalty is controlled by the regularization parameter λ (Supplementary material 1). A fivefold cross-validation was performed to determine the optimal value of λ. The multivariable logistic regression model using LASSO regulation for feature selection is denoted as LASSO-MLR. Support vector machine. For binary classification, the SVM searches for a hyperplane that maximizes the margin between the two classes. In this study, a radial kernel function was used to map the data from a low-dimensional space into a high-dimensional feature space where the non-linear boundary became a linear boundary. Two hyperparameters were included in the SVM model. C is the regularization parameter that trades off the margin width against the fitting error, and γ is the parameter in the kernel function that controls the overfitting. The NTCP estimates were then generated from a decision function using a logit transformation (Supplementary material 1).
A grid-research was performed in the space of (log 2 C = − 5:2:15, log 2 γ = − 3: − 2: − 15) to identify the optimal parameter pair (C, γ) using fivefold cross validation 31 . The whole data set was divided into 5 subsets of approximately equal size. In each iteration, a model was trained in 80% of the sample using a pair of (C,γ) and tested in the remaining 20% subsets. The procedure was repeated 5 times. The selection criterion was the averaged value of the area under the receiving operator curve (AUC) computed in the testing samples. All grid points of (C, γ) were tested and the one yielding the maximum AUC was picked. The best parameters of C and γ were then used in the feature selection. The optimal subset of features was determined by sequentially adding features based on the criterion of the fivefold cross-validated AUC until there is no improvement. In the SVM modeling, a total of 24 variables were included, where stage was partitioned as a four-category attribute (1,0,0,0), (0,1,0,0), (0,0,1,0), and (0,0,0,1).

Bootstrapping for feature selection and model validation. The feature selection procedures in
Stepwise-MLR, LASSO-MLR, and SVM were repeated in 1000 bootstrap samples drawn from the replacement of the original sample. The features with a selection frequency of greater than 80% were picked for the final models. Another threshold was applied for the selection frequency if the highest selection frequency was lower than 80%. With the selected features, models were built on the original sample and the performance was evaluated (apparent performance). For SVM, the grid-search step was repeated to find the best pair (C, γ) within the selected feature space. www.nature.com/scientificreports/ We also used the bootstrap approach to internally validate the predictive performance of different NTCP models. Model fitting including feature selection and coefficient estimation was repeated on each bootstrap sample using the specific modeling technique. From the resulting model, the performance was evaluated in the bootstrap sample (bootstrap performance) and the original sample (test performance). The optimism was calculated as the difference between bootstrap performance and test performance. A lower value of optimism denotes a lower level of overfitting. The optimism should be corrected to reflect the stable and unbiased estimation of the model performance in future data. The optimism-corrected performance was then obtained by subtracting the optimism from the apparent performance.
The performance of the model was assessed using the Negelkerke R 2 , AUC, and Hosmer-Lemeshow (HL) test. Negelkerke R 2 is an overall measure to quantify the variability explained by a model. AUC is used to indicate how well the model classifies patients into different risk groups. HL test assessed the calibration of the model, and a p value of greater than 0.05 indicates good agreement between predicted probability and observed risk.  Figure 1 shows the selection frequency for all the variables in the 1000 bootstrap samples with the modeling techniques Stepwise-MLR, LASSO-MLR, and SVM. For the variables with greater than 80% frequency, stepwise selection and LASSO had two common variables: CCRT (stepwise: 80.8%, LASSO: 99.7%) and EUD (stepwise: 93.6%, LASSO: 96.7%). Compared with stepwise selection, LASSO tended to include more variables and it selected V75 with a frequency of 92.4%. Using 60% as a threshold, the variables selected by SVM were CCRT (66.3%) and EUD (69.2%), matching the variable selection based on the stepwise approach. Table 2 presents the regression coefficients for the two MLR models using the selected variables and the optimal values of C and γ for the SVM model.
A summary of the apparent performance, bootstrap performance, optimism, and the optimism-corrected performance of all the models are listed in Table 3. The optimism-corrected Negelkerke R 2 and AUC values of the LASSO-MLR were also higher than the other models. Stepwise-MLR showed a marginal improvement in the optimism-corrected-AUC (0.797) compared with the gLKB model (0.796). A similar predictive performance was found for the SVM and sLKB model with regard to the optimism-corrected Negelkerke R 2 and AUC values. Both of the optimism-corrected Negelkerke R 2 values of the SVM and sLKB models were 0.301, and the optimismcorrected AUC values were 0.783 and 0.784 for the sLKB and SVM models respectively. Moreover, there were minimal differences between SVM and gLKB, Of note, the advanced SVM modeling technique was more prone to overfitting, as indicated by its highest optimism (0.039 and 0.015 for Negelkerke R 2 and AUC respectively). By contrast, the lowest optimism was observed in the conventional sLKB model. Figure 2 displays the receiving operator curves and the calibration plots of all the models. A comparison of receiving operator curves of all the models (Fig. 2a) showed that sLKB was less accurate in detecting true positive cases at a higher threshold as compared with the other models. The predictions of all the models significantly agreed with the observed outcome, as demonstrated by the non-significant HL tests. The loess smoother of LASSO-MLR was closer to the ideal line, indicating a better calibration performance.

Discussion
In this study, we investigated and internally validated the predictive performance of NTCP models developed for radiation-induced esophagitis in NSCLC patients receiving proton radiotherapy using different modeling techniques ranging from conventional analytic solution, logistic regression, to advanced machine learning. Of the five NTCP models, LASSO-MLR showed the best fit for the data. The sLKB model had the lowest optimism. SVM modeling resulted in NTCP models with a good apparent performance but with the highest optimism. Our results highlighted the non-inferiority of the conventional LKB modeling to the advanced SVM modeling in the discriminative ability and the accuracy of the predicted probability.
The predictive performance of the NTCP model was improved when more factors were considered. For the LKB modeling, our results of the comparison of the apparent performance were in line with the studies conducted by Peeter et al. 15 and Defraene et al. 33 , who found the modified LKB model with both the dosimetry and clinical factors gave a significantly better fit than the standard LKB model with dosimetry alone for different endpoints. Previous studies also reported similar findings in multivariable logistic regression modeling that the model with more predictors had a higher performance than the one with fewer predictors 34,35 . However, the comparison of the optimism-corrected performance has not been investigated. Of note is that the optimism-corrected performance was considered a less unbiased assessment of the model in the future data than the apparent performance 36 . It is more likely to have a higher optimism by including more variables, thus resulting in lower optimism-corrected performance. Xu et al. 35 demonstrated that an all-variable logistic regression model was susceptible to overfitting, which suggested a robust variable selection approach. As shown in Table 3, thought the optimisms of gLKB and LASSO-MLR were higher than those of their counterparts using the same modeling technique, the optimism-corrected performance of gLKB and LASSO-MLR were still higher. www.nature.com/scientificreports/ The advanced SVM modeling technique did not offer improvement in our prediction of the dose-response relationship over the conventional LKB and logistic regression models. Moreover, the predictive performance of the SVM model was slightly poorer than the Stepwise-MLR that used the same predictors. The results would probably be explained by the γ parameter of the SVM model. In addition to the regulation parameter C, the behavior of the SVM model using the radial basis function kernel is very sensitive to the γ parameter. A large value of the γ parameter means that the radius of the influence area of the support vectors only includes the support vector itself and the regulation from the parameter C is unable to prevent overfitting. A small value of the γ parameter denotes an overly constrained model that cannot capture the complexity of the data, resulting in a model with a similar performance as the model assuming the linearity of the predictors. The γ parameter in the optimized SVM model has a relatively low value of 2 −13 . Therefore, the SVM model performed comparably to the logistic regression model under the same feature selection and even did not outperform the simplest analytic LKB model. Lynam et al. 26 also reported that logistic regression performed as well as the machine learning algorithm in their study population when a small number of predictors were considered. Currently, there is no golden standard for variable selection and fewer significant variables are preferred in clinical practice. The nonlinear machine learning approaches may prove more powerful in the context of a larger dataset with more www.nature.com/scientificreports/ variables, such as the study of genomics and radiomics. In addition, the signal-to-noise of the clinical data may be lower. Therefore, the advanced machine learning modeling technique may have limited applicability in the current clinical setting.
For clinical practitioners, intuitive interpretability and ease of implementation were two important aspects when choosing the "best" model. The parameter n in the LKB model provides insight into whether the low dose region or the high dose region matters. We can also derive a dose constraint in terms of EUD from the LKB and logistic regression models based on the monotonically increasing functions of EUD, leading toward individualized treatment planning. In contrast, the mapping of input data and the output predictions in the SVM model is considered a "black box", which prevents practitioners from better understanding the data and quantifying the effect of the dosimetric and non-dosimetric predictors. Incorporating the decision-making platform with a large amount of information given by the treatment planning system and the patient information system such as Epic would facilitate the adoption of a machine learning algorithm in clinical decision-making. In terms of performance and utility, LASSO-MLR was the best modeling technique in the current context.
One limitation of the current study is that only a few clinical predictors had been enrolled in the investigation. A further study would be different NTCP modeling techniques incorporated with functional imaging features and biological markers. In addition, larger and external data sets are needed to verify the findings of this study.

Conclusion
In the context of classification and probability estimation for predicting the NTCP of radiation-induced esophagitis in NSCLC patients receiving proton radiotherapy, a multivariable logistic regression model developed with LASSO provided the best predictive results. The simplest LKB modeling using the analytic function had similar or even better predictive performance than the most complex SVM modeling, and it was least likely to overfit the training data. The advanced machine learning approach might have limited applicability in clinical settings with a relatively small amount of data.  Table 3. Apparent, bootstrap performance and optimism of sLKB, gLKB, Stepwise-MLR, LASSO-MLR, and SVM models. *Apparent performance (optimism-corrected). sLKB standard Lyman-Kutcher-Burman, gLKB generalized Lyman-Kutcher-Burman, Stepwise-MLR multivariable logistic regression using stepwise feature selection, LASSO-MLR multivariable logistic regression using least absolute shrinkage and selection operator for feature selection, SVM support vector machine, AUC area under the receiver operator curve, HL Hosmer-Lemeshow, LL log likelihood, AIC Akaike information criterion.

Data availability
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.